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Abstract 

In modern biology, one of the most important research problems is to understand how protein 
sequences fold into their native 3D structures. To investigate this problem at a high level, one 
wishes to analyze the protein landscapes, i.e., the structures of the space of all protein sequences 
and their native 3D structures. Perhaps the most basic computational problem at this level is 
to take a target 3D structure as input and design a fittest protein sequence with respect to one 
or more fitness functions of the target 3D structure. We develop a toolbox of combinatorial 
techniques for protein landscape analysis in the Grand Canonical model of Sun, Brem, Chan, and 
Dill. The toolbox is based on linear programming, network flow, and a linear-size representation 
of all minimum cuts of a network. It not only substantially expands the network flow technique 
for protein sequence design in Kleinberg's seminal work but also is applicable to a considerably 
broader collection of computational problems than those considered by Kleinberg. We have used 
this toolbox to obtain a number of efficient algorithms and hardness results. We have further used 
the algorithms to analyze 3D structures drawn from the Protein Data Bank and have discovered 
some novel relationships between such native 3D structures and the Grand Canonical model. 



*Department of Computer Science, Yale University, New Haven, CT 06520-8285, USA. Email: aspnes@cs.yale.edu. 
Supported in part by NSF Grant CCR-9820888. 

^Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT 06520-8285, USA. Email: 
julia.kreychmanOyale . edu. 

* Department of Electrical Engineering and Computer Science, Tufts University, Medford, MA 02155, USA. Email: 
kao@eecs.tufts.edu. Supported in part by NSF Grant CCR-9531028. 

^Department of Ecology and Evolutionary Biology, Department of Molecular, Cellular, and Developmental Biology, 
and Department of Statistics, Yale University, New Haven, CT 06520-8285, USA. Email: junhyong.kim@yale.edu. 
Supported in part by Merck Genome Research Institute Grant and NSF Grant DEB-9806570. 

^Department of Computer Science, Yale University, New Haven, CT 06520-8285, USA. Email: gauri . shah@yale . edu. 



1 



1 Introduction 



In modern biology, one of the most important research problems is to understand how protein se- 
quences fold into their native 3D structures [23|. This problem can be investigated at two comple- 
mentary levels. At a low level, one wishes to determine how an individual protein sequence folds. 
A fundamental computational problem at this level is to take a protein sequence as input and find 
its native 3D structure. This problem is sometimes referred to as the protein structure prediction 
problem and has been shown to be NP-hard (see, e.g., H || ) . At a high level, one wishes to ana- 
lyze the protein landscapes, i.e., the structures of the space of all protein sequences and their native 
3D structures. Perhaps the most basic computational problem at this level is to take a target 3D 
structure as input and ask for a fittest protein sequence with respect to one or more fitness functions 
of the target 3D structure. This problem has been called the protein sequence design problem and 
has been investigated in a number of studies @, @, |, |l|, |7|, HH, H ■ 

The focus of this paper is on protein landscape analysis, for which several quantitative models have 
been proposed in the literature [0, ^0[^]. As some recent studies on this topic have done ||, 18, 24 1, 
this paper employs the Grand Canonical (GC) model of Sun, Brem, Chan, and Dill [33|, whose 
definition is given in Section Generally speaking, the model is specified by (1) a 3D geometric 
representation of a target protein 3D structure with n amino acid residues, (2) a binary folding code in 
which the amino acids are classified as hydrophobic (H) or polar (P) |jq,|i~9||, and (3) a fitness function 
$ defined in terms of the target 3D structure that favors protein sequences with a dense hydrophobic 
core and with few solvent-exposed hydrophobic residues. 

In this paper, we develop a toolbox of combinatorial techniques for protein landscape analysis 
based on linear programming, network flow, and a linear-size representation of all minimum cuts of 
a network Pq|. This toolbox not only substantially expands the network flow technique for protein 



sequence design in Kleinberg's seminal paper [18] but also is applicable to a considerably broader 
collection of computational problems than those considered by Kleinberg. We have used this toolbox 
to obtain a number of efficient algorithms and hardness results. We have further used the algorithms 
to analyze 3D structures drawn from Protein Data Bank at http://www.rcsb.org/pdb and have 
discovered some novel relationships between such native 3D structures and the Grand Canonical 
model (Figure [l]). Specifically, we report new results on the following problems, where A is the 
number of terms in the fitness function or functions as further defined in Section 3.1. Many of the 
results depend on computing a maximum network flow in a graph of size 0(A); in most cases, this 
network flow only needs to be computed once for each fitness function <£. 



PI Given a 3D structure, find all its fittest protein sequences. Note that there can be exponentially 
many fittest protein sequences. We show that these protein sequences together have a represen- 
tation of size O(A) that can be computed in O(A) time after a certain maximum network flow 
is computed (Theorem ||), and that individual fittest protein sequences can be generated from 
this representation in 0(n) time per sequence (Theorem |9|) . 

P2 Given / 3D structures, find the set of all protein sequences that are the fittest simultaneously 
for all these 3D structures. This problem takes 0(A) time after / maximum network flow 
computations (Theorem |8|) . 

P3 Given a protein sequence x and its native 3D structure, find the set of all fittest protein sequences 
that are also the most (or least) similar to x in terms of unweighted (or weighted) Hamming 
distances. This problem takes O(A) time after a certain maximum network flow is computed 
(Theorem |). 
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P4 Count the number of protein sequences in the solution to each of Problems PI, P2, and P3 
These counting problems are computationally hard (Theorem 17). 



P5 Given a 3D structure and a bound e, enumerate the protein sequences whose fitness function 
values are within an additive factor e of that of the fittest protein sequences. This problem takes 
polynomial time to generate each desired protein sequence (Theorem IB 



P6 Given a 3D structure, determine the largest possible unweighted (or weighted) Hamming dis- 
tance between any two fittest protein sequences. This problem takes O(A) time after a certain 



maximum network flow is computed (Theorem 10) 



P7 Given a protein sequence x and its native 3D structure, find the average unweighted (or weighted) 
Hamming distance between x and the fittest protein sequences for the 3D structure. This 
problem is computationally hard (Theorem |lj 



P8 Given a protein sequence x, its native 3D structure, and two unweighted Hamming distances d\ 
and d>2, find a fittest protein sequence whose distance from x is also between d\ and d 2 . This 



problem is computationally hard (Theorem 18(1] 



P9 Given a protein sequence x, its native 3D structure, and an unweighted Hamming distance d, 
find the fittest among the protein sequences which are at distance d from x. This problem is 
computationally hard (Theorem |l8|(p|)). We have a polynomial-time approximation algorithm 
for this problem (Theorem |l3|) . 



P10 Given a protein sequence x and its native 3D structure, find all the ratios between the scaling 
factors a and /? in Equation || in Section |2| for the GC model such that the smallest possible 
unweighted (or weighted) Hamming distance between x and any fittest protein sequence is 
minimized over all possible a and j3. (This is a problem of tuning the GC model.) We have a 
polynomial-time algorithm for this problem (Theorem 16). 



Pll 



Given a 3D structure, determine whether the fittest protein sequences are connected, i.e., whether 
they can mutate into each other through allowable mutations, such as point mutations, while 

This problem 



the intermediate protein sequences all remain the fittest pLH. 17 



takes 0(A) time after a certain maximum network flow is computed (Theorem 11). 



P12 Given a 3D structure, in the case that the set of all fittest protein sequences is not connected, 
determine whether two given fittest protein sequences are connected. This problem takes 0(A) 



time after a certain maximum network flow is computed (Theorem 11). 



P13 Given a 3D structure, find the smallest set of allowable mutations with respect to which the 
fittest protein sequences (or two given fittest protein sequences) are connected. This problem 
takes O(A) time after a certain maximum network flow is computed (Theorem 11). 



Previously, Sun et. al. |£J developed a heuristic algorithm to search the space of protein se- 
quences for a fittest protein sequence without a guarantee of optimality or near-optimality. Hart 
|j~6[| subsequently raised the computational tractability of constructing a single fittest protein se- 
quence as an open question. Kleinberg jl^] gave the first polynomial-time algorithm for this problem, 
which is based on network flow. In contrast, Problem [Pi] asks for all fittest protein sequences and 
yet can be solved with the same time complexity. Kleinberg also formulated more general versions 
of Problems Pll and P12| by extending the fitness function to a submodular function and gave 



polynomial-time algorithms. Our formulations of these two problems and Problem P13 are directly 
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based on the fitness function of the GC model; furthermore, as is true with several other problems 



above, once a solution to Problem PI is obtained, we can solve these three problems in O(A) time. 
Among the above thirteen problems, those not yet mentioned in this comparison were not considered 
by Kleinberg. 

The remainder of this paper is organized as follows. Section |2] defines the GC model and states 
the basic computational assumptions. Section |3| describes our three basic tools based on linear 
programming, network flow, and an 0(A)-size representation of minimum cuts. Section || extends 
these tools to optimize multiple objectives, analyze the structures of the space of all fittest protein 
sequences, and generate near-fittest protein sequences. Section || gives some hardness results related 
to counting fittest protein sequences and finding fittest protein sequences under additional restrictions. 
Finally, Section ^ discusses our analysis of empirical 3D structures from the Protein Data Bank. 

2 The Grand Canonical Model and Computational Assumptions 

The Original Model Throughout this paper, all protein sequences are of n residues, unless ex- 
plicitly stated otherwise. The GC model is specified by a fitness function $ over all possible protein 
sequences x with respect to a given 3D structure of n residues [jl^,|33|]. In the model, to design a 
protein sequence x is to specify which residues are hydrophobic (H) and which ones are polar (P). 
Thus, we model x as a binary sequence x\, . . . ,x n or equivalently as a binary vector (x\, . . . ,x n ), 
where the i-th residue in x is H (respectively, P) if and only if X{ = 1 (respectively, 0). Then, $(x) 
is defined as follows, where the smaller &(x) is, the fitter x is, as the definition is motivated by the 
requirements that H residues in x (1) should have low solvent-accessible surface area and (2) should 
be close to one another in space to form a compact hydrophobic core. 

i,jeH(x),i<j-2 i&H{x) 

= a ^2 9(di,j)xiXj + P^SiXi, where (2) 



i<j-2 



H(x) = {i | Xi = 1}, 



the scaling parameters a < and (3 > have default values —2 and | respectively and may 
require tuning for specific applications (see Section |~~ 



• Si > is the area of the solvent-accessible contact surface for the residue (in A) flC, 11|, 

• dij > is the distance between the residues i and j (in A), and 

• g is a sigmoidal function, defined by 

g= i l+cxp(<L-6.5) When ^ 6 - 5 

[ when dij > 6.5. 

Extending the Model with Computational Assumptions Let opt(<I>) be the set of all pro- 
tein sequences x that minimize This paper is generally concerned with the structure of opt(<I>). 
Our computational problems assume that $ is given as input; in other words, the computations of 
a, f3, Si, g(dij) are not included in the problems. Also, for the sake of computational generality and 
notational simplicity, we assume that a may be any nonpositive number, (3 any nonnegative number, 
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Si any arbitrary number, and g(dij) any arbitrary nonnegative number; and that the terms g(dij) 
may range over 1 < i < j < n, unless explicitly stated otherwise. Thus, in the full generality of these 
assumptions, <I> need not correspond to an actual protein 3D structure. Note that the relaxation 
that Si is any number is technically useful for finding ^-minimizing protein sequences x that satisfy 
additional constraints. 

We write a^.,- = —a-g(dij) > and bi = (3-Si and further assume that the coefficients Oj j and 6, 
are rational with some common denominator, that these coefficients are expressed with a polynomial 
number of bits, and that arithmetic operations on these coefficients take constant time. 

With these assumptions, we define the following sets of specific assumptions about $ to be used 
at different places of this paper. 

Fl Let <3?(x) = —J2i<i<j<n a i,j x i x j + J2i<i<nbi%i, where Oij > 0, bi is arbitrary, and m of the 
coefficients cnj are nonzero. Let A = n + m. 

F2 For each (3 > 0, let $p(x) = -J2i<i<j<n a i,j x i x j +^Ei<i<n s i :c » where aij > 0, Si > 0, and m 
of the coefficients a% j are nonzero. Let A = n + m. 

F3 For each £ from 1 to /, let the £-th. fitness function $ (x) = —J2i<i<j<n a i,j x i x j +Si<i<n^f x *' 
where afj > and b\ is arbitrary. Let A = fn 2 . 

Sometimes we measure the dissimilarity between a fittest protein sequence x and a target protein 
sequence x in terms of Hamming distance. This distance is essentially the count of the positions i 
where x% ^ x% and can be measured in two ways. The unweighted Hamming distance is \x — x\, where 
\y\ denotes the noTTn of vector y, i.e., ^2i=i The weighted Hamming distance is ^2i=i ^i'\ x i — x i\- 
Throughout this paper, the weights w\,... ,w n are all arbitrary unless explicitly stated otherwise. 



3 Three Basic Tools 



This section describes our basic tools for computing fittest and near-fittest protein sequences. For 
instance, Lemma |l| gives a representation of the problem of minimizing $ as a linear program. 
Lemma § further gives a representation of this problem as a minimum-cut problem, which generalizes 
a similar representation of Kleinberg |l8|]. Theorem |5| gives a compact representation of the space 
opt( < I ) ) using a Picard-Queyranne graph [26|. 



3.1 Linear Programming 

From Equation ^, minimizing <&(x) is an optimization problem in quadratic programming. Fortu- 
nately, because all the coefficients a^j are nonnegative, it can be converted to a linear program, as 
shown in Lemma |l[ 



Lemma 1 (characterizing $ via linear program) Let $ be as defined in Assumption Fl. Con- 
sider the following linear program whose variables consist of the variables x%, together with new 
variables yij for all i,j with a^j ^ 0: 

minimize <J?'(x, y) = —J2 a i,jlli,j + J2 hxi 
subject to 

< xi < 1 V* 

0<Vij<l^ 
y%,j < x i } Vi, j : Oij / 

Vi,j — x j 
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There is a one-to-one correspondence that preserves x between the protein sequences that minimize 
<J>(x) and the basic optimal solutions to Linear Program (|3j). 



Proof: First, we show that for each 0-1 assignment to x there is a unique value of y that 
minimizes &(x,y). Choose some yij, and suppose that either xi or Xj is 0. Then y^j is also by 
the constraint yij < X{ or yij < xj. Alternatively, suppose Xi and xj are both 1; then if y^j is 0, $>' 
can be decreased by aij by setting yij to 1 without violating any constraints. Thus, in any optimal 
integral solution to Linear Program (|3|), y^j = min(xj 

Note that substituting XiXj for yij in $' gives precisely —J2i,j a i,j x i x j + J2ihxi = thus 
minimizing <&'(x,y) is equivalent to minimizing 

We now must show that all solutions to Linear Program @ are integral. Every element of the 
constraint matrix is either zero or ±1. Each row has either a single nonzero element (e.g, for the 0-1 
bounds) or consists of zeroes and exactly one +1 and one —1. Thus the matrix is totally unimodular, 
e.g., using pR Theorem 13.3]. Since the right-hand side is integral, any vertex of the polytope 



defined by Linear Program (y) is integral [25, Theorem 13.2]. Thus, all basic feasible solutions to 
Linear Program (0) are 0-1 vectors. 

So if (x,y) is a basic optimal solution to Linear Program (||), then x € opt( < I ) ). Conversely, if 
x € opt( < I ) ), then the vector (x, y) in which y^j = XiXj whenever aj j is nonzero is an optimal solution 
to Linear Program (||) , which is a basic optimal solution since an appropriate subset of the constraints 
< X{ < 1 and < yij < 1 form a basis. I 

Note that any Xi with a negative coefficient bi is set to 1 in any optimal solution, as in this case 
all terms containing Xi have negative coefficients and are minimized when Xi = 1. So an alternative 
to allowing negative coefficients is to prune out any Xj with a negative coefficient. This process must 
be repeated recursively, since setting Xi to 1 reduces terms of the form —aijXiXj to —aijXj, and may 
yield more degree-1 terms with negative coefficients. To simplify our discussion, we let the linear 
program (or, in Section ^L^, the minimum-cut algorithm) handle this pruning. 

3.2 Network Flow 

Recall that an s-t cut is a partition of the nodes of a digraph into two sets V s and Vt, with s (zV s and 
t € Vt- Also, a minimum s-t cut is an s-t cut with the smallest possible total capacity of all edges 
from nodes in V, to nodes in Vt. 



In Kleinberg's original construction [18], <£(x) was minimized by solving an s-t minimum cut 
problem in an appropriate digraph G. Lemma [2] describes a more general construction that includes 
additional edges (s,Vi) to handle negative values for bi. 

Lemma 2 (characterizing $ via network flow) Let $ be as defined in Assumption [Fj. Let 
be a graph with a source node s, a sink node t, a node Vi for each i, and a node Uij for each i,j with 
aij 7^ 0, for a total ofn + m + 2 = A + 2 nodes. Let the edge set of consist of 

• (s,u,ij) for each Uij, with capacity a i: j, 

• (vi,t) for each Vi with bi > 0, with capacity bi, 

• (s,Vi) for each Vi with bi < 0, with capacity —bi, and 

• (uij,Vi) and (uij,Vj), for each u%a, with infinite capacity, 
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for a total o/@(A) edges. 

There is a one-to-one correspondence between the minimum s-t cuts in G* and the protein se- 
quences in opt($), such that V{ is in the s- component of a cut if and only if Xi = 1 in the corresponding 
protein sequence. 

Proof: We will show that the minimum s-t cuts in G* correspond to ^-minimizing protein 
sequences via Linear Program (||) of Lemma Given a minimum s-t cut in G, let Xi be 1 if v j 
is in the s component, and otherwise. Similarly, let yij be 1 if Uij is in the s component, and 
otherwise. Since no infinite-capacity edge (uij,Vi) or (uij,Vj) can appear in the cut, if Uij is in 
the s-component then Vi and Vj are as well. In terms of the x and y variables, we have j/j < Xi 
and yij < Xj whenever ctij is nonzero, precisely the same constraints as in Linear Program @. 
Conversely, any 0-1 assignment (x, y) for which these constraints hold defines an s-t cut that does 
not include any infinite-capacity edge. 

Turning to the objective function, the total capacity of all edges in the cut is 

04 J (1 - Vij) + Y b i X i + - Xi) 

dij^O bi>0 bi<0 

= a hj ~Y bi ~Y a hjyi,3 + Y biXi = K + y)' 

i,j bi<0 i,j i 

where K is a constant and &(x, y) is the objective function of Linear Program (|3|). Thus, the capacity 
of the cut is minimized when <&'{x,y) is. The rest follows from Lemma [IJ, I 



Lemma 3 Let be as defined in Assumption Fl. Given $ as the input, we can find an x € opt( ( &) 
inO(A 2 logA) time. 

Proof: Given a digraph G = (V, E) as input, the Goldberg- Tarjan maximum-flow algorithm 
takes 0{\V\\E\\og{\V\ 2 /\E\)) time Q. We first apply Lemma | to $ to obtain G*. We next use 
this maximum-flow algorithm to find a minimum s-t cut in G* and then an optimal x from this cut. 
All these steps take 0(A 2 log A) total time. I 



3.3 A Compact Representation of Minimum Cuts 

A given $ may have more than one fittest protein sequence. Theorem || shows that opt(<3?) can be 
summarized compactly using the Picard-Queyranne representation of the set of all minimum s-t cuts 
in a digraph G [p6[], which is computed by the following steps: 

1. computing any maximum flow <j) in G; 

2. computing strongly connected components in the residual graph G^ whose edge set consists of 
all edges in G that are not saturated by 0, plus edges (v, u) for any edge (u, v) that has nonzero 
flow in <f>; 

3. contracting G^ by contracting into single supernodes the set of all nodes reachable from s, the 
set of all nodes that can reach t, and each strongly connected component in the remaining 
graph. 

The resulting graph G Sj t is a dag in which s and t are mapped to distinct supernodes by the 
contraction. Furthermore, there is a one-to-one correspondence between the minimum s-t cuts in G 
and the ideals in G S} t, where an ideal is any node set / with the property that any predecessor of a 
node in / is also in /. 
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Lemma 4 (see [26]) Given a digraph G with designated nodes s andt, there is a graph G s j together 



with a mapping k from V(G) to V{G Sj t) with the following properties: 

1. \V(G 9tt )\ < \V(G)\. 

2. The node k(s) has out-degree while n(t) has in-degree 0. 

3. Given G as the input, G s j and k can be computed using one maximum-flow computation and 
0(\E(G)\) additional work. 

4- A partition (V s ,Vt) of V(G) is an s-t minimum cut in G if and only if Vt = for some 

ideal I of G S) t that contains n(t) but not k(s). 

Combining Lemmas |2] and || gives the desired compact representation of the space of all fittest 
protein sequences, as stated in the next theorem. 

Theorem 5 (characterizing <I> via a dag) Let $ be as defined in Assumption [FI|. There exists a 
dag Gf t with designated nodes s' and t' and a mapping p from {1, . . . , n} to V(Gf t ) with the following 
properties: 

1. Gf t has at most n + 2 nodes. 

2. Given $ as the input, Gf t and p can be computed in 0(A 2 logA) time. 

3. There is a one-to-one correspondence between the protein sequences x S opt(3>) and the ideals of 
Gf t = Gft — {s', t'}, in which Xi = if and only if p(i) = t' or p(i) is in the ideal corresponding 
to x. 

Proof: The graph Gf t is obtained by applying Lemmas [2] and ||]. Let k be the contraction map 
from Lemma | Let s' = k(s) and t' = k{€). The mapping p(i) is defined as n(vi). 

To show that Gf t has at most n + 2 nodes, consider any node Ui in G*. Let <j> be the maximum 
flow used to define Gf t . If 4>(s,Uij) = 0, then mj is reachable from s in the residual graph Gf, and 
Uij is contracted onto the k(s) supernode; if <j)(s, Uij) > 0, then at least one of (j)(ui t j,Vi) or cp(ui t j,Vj) 
is nonzero and tijj is in the same strongly connected component in as at least one of V{ and vj. 
In either case Uij is contracted onto a supernode that contains s or some Vf, since the same thing 
happens to all Uij, there are at most n + 2 supernodes in Gf t : one for each plus one for each of 
s and t. 

Using ideals of Gf t is justified by the observation that requiring t' to be in an ideal and s' to be 
out of it has no effect on the presence or absence of other nodes, as t' has no predecessors and s' has 
no successors in Gf t ; thus there is a one-to-one correspondence preserving all nodes except s' and t' 
between the ideals of Gf t containing t' but not s' and the ideals of Gf t . I 

Remark. At some additional cost in time, Assumption |Fl] can be replaced in Theorem || by the 
weaker assumption that $ is submodular (i.e., that $(X U Y) + $(X n Y) < $(X) + ®{Y) for all 
X, Y, where each protein sequence x in <3?'s domain is regarded as the set X = {i \ x% = 1} for 
the purposes of taking unions and intersections). The reason is that a representation similar to the 
Picard-Queyranne graph exists for the set of minima of any such submodular function. These minima 
form a family closed under union and intersection, and any such family corresponds to the ideals of 
an appropriate digraph |[^, Proposition 10.3.3]. Such a representation can be computed efficiently, 
as shown by Gabow |12ft. 
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Intuitively, what Theorem || says is the following. For any the residues in fittest protein 
sequences are grouped into clusters, where the cluster p~ 1 (s) is always H, the cluster is always 

P, and for each of the remaining clusters, all residues in the cluster are either all H or all P. In 
addition, there is a dependence given by the edges of Gf t , such that if a cluster corresponding to the 
source of an edge is all H then the cluster at the other end is also all H. 

There is no additional restriction on the structure of the space of all fittest protein sequences be- 
yond those that follow from correspondence with the ideals of some digraph. As shown in Theorem 0, 
any graph may appear as Gf t , with any number of residues mapped to each supernode. 

Theorem 6 (characterizing a dag via <&) Let G be an arbitrary digraph with n nodes, labeled 1 to 
n, and m edges. Let Go be the component graph of G obtained by contracting each strongly connected 
component of G to a single supernode through a contraction map k. Then, there exists some <3? as 
defined in Assumption [FJ such that for the Gf t and p defined in Theorem |J, an isomorphism exists 
between Gf t and Go mapping each p(i) to k(i) . 

Proof: Represent each node i in G by the variable x,. For each i,j, let Cij = 1 if there is a 
directed edge in G, and otherwise. To define let aij = e^j + e^j for each i < j; and, for 
each i, let bi equal i's out-degree 5 + (i) = J2j e i,j- Apply Lemma ^ to the resulting function $ to get 
a graph G . Define a flow (f> in G* as follows: 

(f)(s,Uij) = eij + ejj Vwjj 
4>(uij,Vi) = e it j Wmj 

<t>{UiJ,Vj) = Bj t i VWjJ 

(j)(vi,t) = 8 + (i) Vvi 

Note that this flow is a maximum flow because it saturates all edges leaving s as well as all edges 
entering t. (It happens that this is the unique maximum flow, but we do not need this fact, as the 
Picard-Queyranne construction works for any maximum flow.) 

Our next goal is to show that the residual graph of this flow contracts to Go- has the 
following classes of edges: 



(uij,s) 






(Ui,j,Vi) 






(UiJ,Vj) 








when (i,j) 


G G 


(vj,u i:j ) 


when (j, i) 


€ G 


(Vi,t) 







Since s has no successors and t has no predecessors in G*, the supernodes in Gf t containing s and t 
consist of only s and t, respectively. Each node Uij is in the same strongly-connected component as at 
least one of Vi or Vj , so no other supernodes exist in Gf t that do not contain at least one of the nodes V{. 
Note that every node-simple path from V{ to Vj in is of the form Vi,Ui iqi ,v gi ,u qi:g2 , . . . ,u qk _ 1: j,Vj 
(with the subscripts of each Uq uqi+1 possibly reversed), which corresponds to a node-simple path 
i = qx, q2, ■ ■ ■ , qk = j in G. The converse also holds. Therefore, i and j are in the same strongly 
connected component in G if only if V{ and Vj are in the same strongly connected component in Gf t . 
Now recall that for each i, p(i) is defined in Lemma ||| as the supernode in Gf t containing Uj. So an 
edge from p(i) to p{j) in Gf t corresponds to a node-simple path from V{ to Vj Gf t . By the above 
path-to-path correspondence, every directed edge from p{i) to p(j) in Gf t corresponds to an edge 
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from k{i) to and vice versa. In summary, Gf t is isomorphic to Go, with the correspondence 

p(i) «-> k(z) for all i. I 

4 Further Tools for Protein Landscape Analysis 
4.1 Optimizing Multiple Objectives 

We can extend the results of Section || beyond optimizing a single fitness function. 

With more than one fittest protein sequence to choose from, we may wish to find a fittest protein 
sequence x that is the closet to some target protein sequence x in unweighted or weighted Hamming 
distance. Theorem ^ shows that this optimization problem is as easy as finding an arbitrary fittest 
protein sequence. 

We may also wish to consider what protein sequences are simultaneously the fittest for more than 
one fitness function. Theorem ||| shows how to compute a representation of this set similar to that 
provided by Theorem 

Theorem 7 (optimizing Hamming distances and ii/-residue counts over opt(<&)) Let <I> be 



as defined in Assumption Fl. 



1. Given a target protein sequence x, some weights Wi, and <3? as the input, we can find in 
0(A 2 log A) time an x € opt(<I>) with the minimum weighted Hamming distance J2 

over opt(<I>). 

2. Given as the input, we can find in 0(A 2 logA) time an x S opt($) with the largest (or 
smallest) possible number of H residues over opt( < I ) ). 

Proof: The statements are proved as follows. 

Statement |]. Let e be a positive constant at most 4 ^ nc , where W > max \ wi\ and c is the common 
denominator of all coefficients Ojj and 6j. Below we show how to find a desired fittest protein sequence 
by minimizing <3? e (x) = <I>(x) + J2i ew i \ x i ~ 

First of all, since x and x are 0-1 sequences, \x{ — x«| = (xj — x^) 2 = Xj — 2fjXj + Xj. Then, since 
x is given, $ e (x) can be minimized using Lemma || in 0(A 2 log A) time. 

Now suppose that y and z are two protein sequences with y G opt( ( l > ) and z opt(<I > ). Then, 
*(z) - $(y) > ~. Also, \ZiWi\xi-XiW < Wn < ^. Therefore, * 6 («) - $ e (y) > \ - 2Wn > ±. 
Thus every x that minimizes $ e (x) must also minimize $(x). The Hamming distance term in <£ e 
guarantees that from all x that do minimize $(x), minimizing <& t (x) selects one that also minimizes 
this distance. 

Statement ||. To find an x G opt(<I > ) with the largest (respectively, smallest) possible number of 
H residues, apply Statement |l] with all Wi = 1 and all Xj = 1 (respectively, x» = 0). I 

Suppose we are given fitness functions $ 1 , . . . , corresponding to multiple 3D structures, and 
we wish to find protein sequences that are simultaneously optimal for each 3D structure. A simple 
approach is to observe that the function <£* = < I> 1 + • ■ ■ + <3?^ satisfies Assumption Fl, and that any 



protein sequence that simultaneously optimizes each $ optimizes <&*. However, we must check any 
minimum solution for <I>* to see that it is in fact a minimum solution for each as it may be that 
the sets of minimum solutions of the have empty intersection. Performing both the optimization 
of and of the / individual fitness functions requires / + 1 network flow computations. 

It turns out that we can reduce this cost to / network flows at the cost of some additional work 
to compute a composite Picard-Queyranne graph Gf t directly from the individual graphs Gf t . This 
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approach, described in Theorem ||, is especially useful if we have already computed the individual 
graphs for some other purpose. 

Theorem 8 (minimizing multiple fitness functions) Let . . . , be as defined in Assump- 
tion [F3( . For each I, let Gf t and p e be the dag and map computed from <3?^ in Theorem [J. Given 
all Gf t and p as the input, there is an 0(A) -time algorithm that either (a) determines that there 
is no protein sequence x that simultaneously minimizes through <$>■> , or (b) constructs a dag Gf t 
with designated nodes s' and t' and a mapping p* from {1, . . . ,n} to V(Gf t ), such that there is a 
one-to-one correspondence between the protein sequences x that simultaneously minimize all &(x) 
and the ideals of Gf* t = Gf t —{s',t'}, in which Xi = if and only if p*(i) = t' or p*(i) is in the ideal 
corresponding to x. 

Proof: By Theorem ||, the conditions below are necessary and sufficient for x to minimize <& . 

• Xi = 1 if p (i) = s'. 

• Xi = if p e (i) = t'. 

• Xi = xj if p e (i) = p e (j). 

• Xi < xj if (p e (i) , p £ (j)) is an edge in Gf e t . 

We will build a graph G whose nodes are s, t, and 1, . . . ,n, and put in an edge (u, v) between 
any nodes for which the constraint u < v is required to minimize some $ . In particular, we have the 
following classes of edges, for each £, where each class represents one of the above conditions: 

• (s, v) and (v, s) whenever p^(v) = s' . 

• (t, v) and (v,t) whenever p e (v) = t'. 

• (u,v) and (v , u) whenever p e (u) = p (v). 

• (u,v) whenever (p e (u),p e (v) E E(Gf t )). 

An assignment of 1 to s, to t, and Xi to each node i satisfies u < v whenever (u,v) £ E(G) if 
and only if all of the constraints required for x to simultaneously minimize all $ are satisfied. Note 
that such an assignment might not exist. 

To convert G into the desired graph Gf t , and to check whether there exist any assignments 
meeting the constraints, contract each strongly connected component of G. If s and t are in the same 
strongly connected component, no simultaneous fittest solutions exist. Otherwise, let s' in Gf t be 
the supernode that contains s from G; let t' be the supernode that contains t. Also, for each u in 
1, . . . , n, let p*(u) be the supernode into which u is contracted. Then x simultaneously minimizes all 
& e if and only if Xj = 1 when p*(i) = s', x% = when p*(i) = t', and X{ < xj when (p* (i) , p* (j)) is an 
edge in Gf t — precisely the condition that the zeroes in x correspond to nodes in some ideal of Gf * t 
that contains t' but not s'. 

To show the running time, observe that constructing the graph G takes O(A) time, which domi- 
nates the contraction step. I 
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4.2 The Space of All Fittest Protein Sequences 

This section discusses some applications of the representation of the space opt( < I ) ) given by Theorem & 
Theorem |9| gives an algorithm to enumerate this space. Theorem 10 gives an algorithm to compute the 
diameter of the space in nonnegatively weighted Hamming distance. Theorem [ll] gives an algorithm 
to determine connectivity properties of the space with respect to various classes of mutations. 



Theorem 9 (enumerating all protein sequences) Let $ be as defined in Assumption Fl. Given 
the Gf t and p defined in Theorem Q as the input, the protein sequences in opt(<I>) can be enumerated 
in 0{n) time per protein sequence. 

Proof: An algorithm of Steiner [32 enumerates the ideals of in time 0(\V(Gfj\) = 0(n) 
per ideal. For each ideal, invert the mapping p (in O(n) time) to recover the corresponding protein 
sequence x. I 



Theorem 10 (computing the diameter) Let $ be as defined in Assumption Fl. Given the Gf t 
and p defined in Theorem [3| as the input, it takes 0(n) time to compute the diameter of opt(<3?) in 
weighted Hamming distance where the weights W{ are all nonnegative. 

Proof: Any two fittest protein sequences x and y can differ only at indices i where p(i) £ {s, t}. 
Let d be the total weight of indices i £ p~ 1 (V(Gf t )). Then, d is an upper bound on the diameter. 
It is also a lower bound, as and V(Gf t ) are both ideals of Gf t , and these ideals correspond to two 
protein sequences at distance d from each other. I 

We can use Gf t to determine whether opt($) is connected for various models of mutations. For 
instance, we can determine whether the space is connected for one-point mutations, in which at most 
one residue changes with each mutation and all intermediate protein sequences must remain the 
fittest. More generally, we can determine the minimum k so that the space is connected where each 
mutation modifies at most k residues. 



We adopt a general model proposed by Kleinberg [18|. In the model, there is a system A of subsets 
of {1, . . . , n} that is closed downward, i.e., if A C B € A, then A G A. Two protein sequences x and y 
are A-adjacent if they are in opt( < I ) ) and differ exactly at the positions indexed by elements of some 
member of A. A A-chain is a sequence of protein sequences in opt(<I>) where each adjacent pair is 
A-adjacent. Two protein sequences x and y are A-connected if there exists a A-chain between x and 
y. A set of protein sequences is A-connected if every pair of elements of the set are A-connected. We 
would like to tell for any given A and whether particular protein sequences are A-connected and 
whether the entire opt(<I>) is A-connected. 

Kleinberg [|l8| gives polynomial-time algorithms for these problems that take A as input (via oracle 
calls) and depend only on the fact that $ is submodular. We describe a much simpler algorithm 
that uses Gf t from Theorem |f| This algorithm not only determines whether two protein sequences 
(alternatively, all protein sequences in opt(<I ) )) are connected for any given A, but also determines 
the unique minimum A for which the desired connectivity holds. Almost all of the work is done in 
the computation of Gf t ; once we have this representation, we can read off the connectivity of opt(<E>) 
directly. 



Theorem 11 (connectivity via mutations) Let<& be as defined in Assumption Fl. The following 
problems can both be solved in 0{n) time. 
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1. Given the Gf t and p defined in Theorem^ and two protein sequences x and x' in opt(<I>) as the 
input, compute the maximal elements of the smallest downward-closed set system A such that 
x and x 1 are A-connected. 

2. Given the G® t and p defined in Theorem as the input, compute the maximal elements of the 
smallest downward-closed set system A such that opt(3>) is A-connected. 

Proof: The statements are proved as follows. 

Statement 1. Let 1,1' be the ideals in Gf t such that the sets of zeros in x,x' are p~ l (I'), 
respectively. Let the maximal elements of A be the sets p^ 1 ^) over all v £ I © I', where is the 
symmetric difference operator. Thus, A consists of these sets and all of their subsets. We will show 
that A is the smallest downward-closed set system such that there is a A-chain between x and x' in 
opt($). 

First, consider some set system A' where for some v £ / I', A = p~ l (v) is not in A'. Recall 
that x must be constant at the positions indexed by elements of A, where the constant depends on 
whether or not v is in the ideal in Gf t corresponding to x. Partition opt($) into sets fi° and O 1 , 
where consists of all z with z% = j for all i € A. Since x and x' differ on /9 _1 (w), one of them is in 
f2° and the other is in fi 1 . However, since A £ A', no protein sequence in Q° is A'-adjacent to one in 
Q . So there is no A'-chain between x and x'. 

Conversely, to exhibit a A-chain from x to x', it suffices to show by iterations a A-chain from x to 
the protein sequence y whose zeroes are given by / o _1 (/nI / ); the case of x' is symmetric. Let Iq = I. 
If at any iteration Jj = I PI I', we are done. Otherwise, let v be a maximal element in Jj — (I n I'). 
Then, Jj + i = Jj — {v} is also an ideal. Since v G I I', p (v) £ A and the protein sequences 
corresponding to ij and ij+i are A-adjacent. After at most n such iterations, we reach y. 

Statement 2. Let x and x' be the protein sequences in opt(<3?) with the largest and the smallest 
possible numbers of H residues, respectively. In other words, x and x' correspond to Gf t and its 
empty ideal, respectively. If A includes /9 -1 (i>) for all v G V(Gf t ), then by Statement 1, there are 
A-chains between any y € opt( < I ) ) and x' and thus between any two protein sequences in opt(<I>). If 
it does not, then there is no A-chain between x and x' . In summary, the maximal elements of A are 
the sets p~ l {v) over all v £ V(Gf t ). I 



4.3 Generating Near-Fittest Protein Sequences 

Finding good protein sequences other than the fittest is trickier, as Lemma [l] breaks down if we are 
not looking at the fittest protein sequences. This section gives two algorithms that avoid this problem. 
Theorem 12] describes an algorithm to generate all protein sequences x in order of increasing Q(x). 
Theorem describes an algorithm to generate the fittest protein sequences at different unweighted 
Hamming distances, which is useful for examining the trade-off between fitness and distance. 

The algorithm for generating all protein sequences x in increasing order by is based on 

Lemma [3| and a general technique for enumerating suboptimal solutions to combinatorial optimization 



problems due to Lawler [21|. It is similar to an algorithm of Vazirani and Yannakakis 34 j for 
enumerating suboptimal cuts. We cannot use the Vazirani- Yannakakis algorithm directly because 
suboptimal cuts in G* might include cuts corresponding to assignments in which yi j is not equal to 
XiXj for some 

Theorem 12 (enumerating all protein sequences) Let be as defined in Assumption [Fj|. With 
<I> as the input, we can enumerate all protein sequences x in order of increasing $>(x) in time 
0(nA 2 logA) per protein sequence. 
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Proof: For any length-/c 0-1 sequence y = y\, 2/2, • • • , Uk, let A y be the set of all length-n 0-1 
sequences x with Xj = yi for each yi with 1 < i < k. Let e be the empty sequence. Then, A £ is the set 
of all length-n sequences. Observe that we can find an element z € A y that minimizes &(z) over A y 
in time 0(A 2 log A) by setting Zi = yi in Q(z) for each t/j and applying Lemma ||. Furthermore, the 
set A y — {z} is the disjoint union of the sets A r . for k + 1 < i < n, where r^ = Z\, Z2, ■ ■ ■ , Zi—i, (1 — Z{). 

To enumerate x in order of increasing &(x), we maintain a data structure that represents all 
protein sequences less those already returned as a disjoint union of sets of the form A y , together with 
an ^-minimizing element z for each, organized as a priority queue with key &(z). Initially, the queue 
contains only (A £ ,z), where z £ opt(<I>) is computed using Lemma ^ in time 0(A 2 log A). At each 
step, the smallest pair (A y ,z) is removed from the priority queue and is replaced by up to n pairs 
(A r . ,Pi), where pi is an ^-minimizing element of A r . ; z is then returned. Each such step requires no 
more than n applications of Lemma |3|, and the cost of the at most n + 1 priority queue operations is 
at most 0(n log(2 n )) = 0(n 2 ), giving a total cost of <3(nA 2 logA) per value returned. I 

Let x be a target protein sequence. For d £ {0, . . . , n}, let F(d) be the smallest over all 

protein sequences x at unweighted Hamming distance d from x. A basic task of landscape analysis is 
to plot the graph of F. As Theorem |l^(||) in Section || shows, this task is computationally difficult 
in general. Therefore, one way to plot the graph of F would be to use Theorem 12 to enumerate 
all protein sequences x in order of increasing <£(x) until for each d, at least one protein sequence at 
distance d from x has been enumerated. This solution may require processing exponentially many 



protein sequences before F is fully plotted. As an alternative, Theorem |13| gives a tool for plotting 
F approximately in polynomial time. 

Theorem 13 (approximately plotting the energy-distance landscape) Let be as defined 
in Assumption |F1|. For each e, let Q e ( x ) = $(x) + f--\x — x\. Let ^(e) be the minimum Q e (x) 
over all x. 

1. & is a continuous piecewise linear concave function defined on R with at most n + 1 segments 
and thus at most n + 1 corners. 

2. Let (ei, $(ei)), . . . , (efc, $(efc)) be the corners of where e\ < ■■■ < Let di be the slope of 
the segment immediately to the right of e^. Let do be the slope of the segment immediately to 
the left of e\. Then, n = do > d\ > ■ ■ ■ > df. = 0. 

3. Let d G {0, 1, . . . ,n}. 

(a) F(d ) = ¥(ei) - e v d Q . F(d k ) = ¥(e fc ) - e k -d k . For < i < k, F(di) = ¥(e;) - e v di = 
$(e i+ i) - e i+r di. 

(b) For di> d> d i+1 with 0<i<k, F(d) > XF(di) + (1 - X)F(d i+ i), where A 



di—d. 



i+l 



4- Given $ and x as the input, we can compute (e%, $(ei)), . . . , (e&, 3?(efc)) and do,...,d k in 
0(nA 2 log A) time. 

Proof: The statements are proved as follows. 

Statement ffl. The concavity follows from the minimality of <£(e) and the fact that for any fixed 
x, &e(%) is linear in e with slope \x — x\. Then, the continuous piecewise linearity and the counts of 
segments and corners follow from the fact that \x — x\ € {0, 1, . . . , n}. 

Statement By the concavity of <£, do > d\ > • ■ • > d k - Let W = 1 + J2i<i<j< n a i,j + J27=i \ s i\- 
For all e < — W, 3 ) e (x) is minimized if and only if x is at distance n from x. Similarly, for all e > W, 
<£ e (x) is minimized if and only if x is at distance from x. Therefore, do = n and d k = 0. 
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Statement ||. Case |a| is straightforward. To prove Case let x be a protein sequence that has the 
smallest over all protein sequences at distance d from x. Then, F(d) = 3>(x), and & 6i+1 (x) = 
F(d) + ej-fi-d. On the other hand, by the minimality of $ ej+1 (a;) > $(ei+i). Furthermore, by 
Case ||, $(ei + i) = + e i+v d i+ i = F(di) + e i+ ydi. Thus, 

F{d i+ i) < F(d) + e i+r d - e i+v d i+ r, 
F(di) < F(d) + € i+v d-e i+r d t . 

Case [3b| follows from algebra and these two inequalities. 

Statement ^. For given e and x, let line(e,a;) be the line through the point (e,<3? e (x)) and with 
slope \x — x\. Let L e (respectively, R e ) be the protein sequence x such that \x — x\ is the largest 
(respectively, smallest) possible over opt(<3? £ ). Note that L e , R e , line(e,L e ), and line(e, R e ) can be 
computed in 0(A 2 log A) total time using Theorem 0. Furthermore, line(e, L e ) and line(e, R e ) contain 
the segments of <& immediately to the left and the right of e, respectively. Consequently, e is a corner 
of <3? if and only if line(e, L e ) ^ line(e, R e ). 

To compute the corners and slopes of we first describe a recursive corner-slope finding sub- 
routine as follows. The subroutine takes as input an interval [c',e"] where e' < e" together with 
lme(e', R e >) and line(e", L e n). It outputs all the corners (e, 3>(e)) of <I> together with slopes \L e — x\ 
and \R € — x \ where e' < e < e" . There are two cases. 

Case 1: line(e', R e >) = hne(e", L e r>). Then, there is no corner over the interval (e',e"), and thus 
the subroutine call ends without reporting any new corner or slope. 

Case 2: line(e', R € i) ^ line(e", L e »). Then, compute e'" at which line(e', R e >) and line(e", L e «) 
intersect; by the concavity of $ stated in Statement [l], e' < e'" < e". Also, compute line(e'", L € m) and 
line(e'", R e "')- There are two subcases: 

Case 2a: line(e /// , L e m) ^ line (e w , R e m). Then the subroutine returns (e w , $(e w )) as a new corner 
together with slopes \L € m — x\ and \R e m — x\ and recurses on the intervals [e',e"'] and [e"',e"]. 

Case 2b: line(e w , L e m) = line(e"', The subroutine returns no new corner or slope but 

recurses on the intervals [e', e'"] and [e w , e"]. In this case, the subroutine has found the line containing 
a new segment of $, i.e., the segment through the point (e /// ,$(e /// )). 

This completes the description of the subroutine. The running time of this subroutine is dominated 
by that for computing L e /n and R e /n and thus is 0(A 2 log A). 

With this subroutine, we can find the corners and slopes of $ as follows. Recall W from the 
proof of Statement |2[ Note that if e < — W or e > W, then $ has no corner at e. So we compute 
line(— 2W, R-2\v) and line(2VF, L<2w) and apply the subroutine to the interval [— 2W, 2W] to find all 
the corners and slopes of <!>. This algorithm makes 0(n) recursive calls to the subroutine since by 
Statement SI, there are only 0(n) corners and segments, and each recursive call finds at least one 
new corner or segment. The running time of the algorithm is dominated by the total running time 
of these calls and thus is 0(nA 2 logA) as stated in the statement. I 



4.4 Tuning the Parameters of the GC Model 

This section shows how to systematically tune the parameters a and (5 so that a fittest protein 
sequence for a given 3D structure matches the 3D structure's native protein sequence as closely as 
possible in terms of unweighted or weighted Hamming distance. For this purpose, we assume Si > 0. 
Furthermore, since the fitness function does not have an absolute scale, we may fix a at —1 and vary 
(3. In summary, this section adopts Assumption F2. 

Let 11$ be the set of indices i with Si ^ 0. The next lemma shows that for any fittest protein 
sequence x of $,g, the set H (x) n 11$ of H residues with nonzero surface area Sj is monotone in (3. 
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Then, as shown in Theorem 16, to tune (5, we only need to consider at most n + 1 possible values of 



Lemma 14 Let <3? be as defined in Assumption Let y and z be any fittest protein sequences for 
<§>p x and &p , respectively. If Pi < (3q, then H(y) n 11$ D H(z) PI 11$. 

Proof: Let H\ = H(y) and Hq = H(z). Let H\q = Hi — Hq, i.e., the set of indices i for which 
Xi changes from 1 to when (5 changes from (5\ to (3q. Similarly, let Hqi = Hq — H\. Further, let 
#io = Ej G tf 10 Si and B 01 = J2i^H 01 s i- 

Let A = T,i,jeH a i,j ~ £i,jeffi a i,j- Then ' ~ ^piv) = ~A + /3(#oi - #io)- Let A m = 

^2ij<=H a {ij}n_ff O i^0 a *J' = — (-^ ~~ -4oi)) which is the sum of the terms ajj that $p loses 

when Xj changes from 1 to 0. Similarly, Aqi is the sum of the terms aij that §p gains when Xj 
changes from 1 to 0. 

To show H(y) n 11$ D (z) n 11$, we need to prove Hqi n 11$ = or equivalently Bqi = 0. To do 
so by contradiction, suppose Boi > 0. There are two cases: 

Case 1: — «4oi + Po^oi > 0. Notice that the protein sequence z' with H(z') = Hq — Hqi has a 
smaller fitness value for (3q than z does, contradicting the minimality of z. 

Case 2: — Aq\ + /?o#oi < 0. Then, —Aqi + /?i£>oi < 0. Therefor, the protein sequence y' with 
H(y') = H\ U Hqi has a smaller fitness value for (3\ than y does, contradicting the minimality of y. I 

Let <&(/?) be the minimum $g(x) over all x. The next lemma characterizes the structure of <&(/?). 



This structure is then used to tune (3 in Theorem 16 



Lemma 15 Let <I> be as defined in Assumption F2 



1. & is a continuous piecewise linear concave function defined on [0, oo) with at most n+1 segments 
and thus at most n + 1 corners. 

2. For all /?i < /?3 < /?4 < 02 where (/3i, <&(/3i)) and (02, $(02)) are adjacent corners o/ we 
have opt($ / g 3 ) = opt($ / g 4 ) ; opt^/^) C opt( < I ) /3 1 ), and opt^/^) C opt(&p 2 ). Similarly, for all 
/?i < /?3 < /?4 where 3>(/5i)) is i/ie rightmost corner, we have opt($ / g 3 ) = opt( ( I> / 3 4 ) and 

Opt($/3 3 ) C Opt(* A ). 

3. Given <3? as t/ie input, it takes 0(nA 2 log A) time to find the set of all such that (0, $(0)) is 
a corner of . 



Proof: The statements are proved as follows. 

Statement ffl. The concavity follows from the minimality of $(0) and the fact that for any fixed 
x, $p(x) is linear in with slope J2i£H(x)nu^, s i- Then, the continuous piecewise linearity and the 
counts of segments and corners follow from Lemma O . 

Statement & The proofs for the case that (0i,$(0i)) is the rightmost corner and the comple- 
mentary case are similar. So we only detail the proof for the former. Note that (03, $(03)) is not a 
corner. So for every fixed x G opt($ / g 3 ), the line $p(x) goes through the corner (/3i, and the 

point (/?4, 3>(/?4)). Thus, x G opt( < I ) / g 1 ) and x G opt($g 4 ). By symmetry, for every y G opt( < I ) / g 4 ), we 
have y G opt($ / g 3 ). In summary, opt($ / g 3 ) = opt($ / g 4 ) and opt($p 3 ) C opt($ / g 1 ). 

Statement ||[ For any given and x, let line(/3, x) be the line through the point (0,$p(x)) 
and with slope SiG-ff(x)nn $ s i- For each /3, let Lp (respectively, Rp) be the protein sequence x 
such that H(x) has the largest (respectively, smallest) possible cardinality over opt(<l>g). Note that 
Lp, Rp, \ine(0,Lp), and lme(0, Rp) can be computed in 0(A 2 logA) total time using Theorem [7|. 
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Furthermore, line(/3, Lp) and line(/3, Rp) contain the segments of immediately to the left and the 
right of (3, respectively. Consequently, for (3 > 0, (3 is a corner of $ if and only if line(/3, Lp) ^ 
line(/3, Rp). Also, (0, $(0)) is the leftmost corner, and the segment of $ to the right of is contained 
by line(0 5J Ro). 

To compute the corners of <3?, we first describe a recursive corner- finding subroutine as follows. 
The subroutine takes as input an interval [/3i,/?2] where (3\ < (3% together with line(/3i, Rfc) and 
line(/?2, Lp 2 ). It outputs all the corners (/?,$(/?)) of <I> with f3\ < {3 < (32- There are two cases. 

Case 1: line(/3i, i?^) = line^j £/3 a )- Then, there is no corner over the interval (Pi, ^2), and thus 
the subroutine call ends without reporting any new corner. 

Case 2: line(/3i, Rp^ 7^ line^, Lp 2 ). Then, compute P% at which ]ine(Pi, Rp x ) and line(P2, Lp 2 ) 
intersect; by the concavity of <I> stated in Lemma 15(|), Pi < Pz < Pi- Also, compute ]ine(Ps, Lp 3 ) 



and line(/?3, Rp 3 )- There are two subcases: 

Case 2a: line(/?3, Lg 3 ) 7^ line(^3, Rp 3 ). Then the subroutine returns (P3, $>(Ps)) as a new corner 
and recur ses on the intervals [Pi,Pz[ and [Pz,P%\- 

Case 2b: \hxe(P^,Lp 3 ) = rine(/?3, Rp 3 )- The subroutine returns no new corner but recurses on 
the intervals [Pi,Ps] and [Pz,p2\. In this case, the subroutine has found the line containing a new 
segment of <£, i.e., the segment through the point (Pz,^(Pz))- 

This completes the description of the subroutine. The running time of this subroutine is dominated 
by that for computing Lp 3 and Rp 3 and thus is 0(A 2 logA). 

With this subroutine, we can find the corners of $ as follows. If every Sj = 0, then (0, ^(0)) is 
the only corner. Otherwise, let Poo be 1 + J2i<i<j<n a i,j divided by the smallest nonzero Sj. Note 
that for every P > P^, $ has no corner at P, Then, we compute line(0, Lq) and line(0, Rp^)- We 
report the leftmost corner (0, $(0)) and apply the subroutine to the interval [0, P^} to find all the 
other corners of <!>. 



This algorithm makes 0(n) recursive calls to the subroutine since by Lemma 15, there are only 
0(n) corners and segments, and each recursive call finds at least one new corner or segment. The 
running time of the algorithm is dominated by the total running time of these calls and thus is 
0(nA 2 logA) as stated in the lemma. I 

Theorem 16 (tuning a and P) Let be as defined in Assumption Given a target protein 
sequence x and $ as the input, we can find in 0(nA 2 logA) time the set of all P where the closest 
unweighted (or weighted) Hamming distance between x and any protein sequence in opt(<&p) is the 
minimum over all possible p. 

Proof: The proofs for the cases of unweighted and weighted Hamming distances are similar. So 
we only detail the proof for the unweighted case. Our algorithm for finding all distance-minimizing 
choices of P has three stages. 

Stage 1. Use Lemma |l5|(|3|) to find all the corners of <I>. 

Stage 2. For each corner (/?, $(/?)), use Theorem [7| to compute the closest Hamming distance dp 
between x and any protein sequence in opt(&p). Let d m \ n be the smallest dp over all corners. Then, 
by Lemma 15 (2|), report all P with dp = d m j n as desired choices of distance- minimizing p. 



Stage 3. Consider each segment of <£. Let Pi and P2 be the vertical coordinates of the left and 
right endpoints of the segment. Find a suitable P3 in the open interval (Pi,P2) as follows. If P2 
is finite, then set ^3 = (Pi + /?2)/2; otherwise, set Pz = Pi + 1- Then use Theorem |7| to compute 
the closest unweighted or weighted Hamming distance dp 3 between x and any protein sequence in 



opt ($p 3 )- If dp 3 = d m i n , then by Lemma 15 (2|), report that every P in the interval (Pi, P2) is a desired 
distance-minimizing p. 
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This completes the description of the algorithm. By Lemma 15|(3|), Stage 1 takes 0(nA log A) 



time. By Theorem ^ and Lemma ^ (|l]), Stages 2 and 3 also take 0(nA 2 log A) time. Thus, the total 
running time is as stated in the theorem. I 



5 Computational Hardness Results 



Theorem 17 (hardness of counting and averaging) Let <3? be as defined in Assumption Fl. 
The following problems are all #P-complete: 

1. Given <3? as the input, compute the cardinality o/opt(<I>). 

2. Given . . . , <E>^ as the input, where f is any fixed positive integer and $ 1 , . . . , $>f are as defined 



in Assumption Fc, compute the number of protein sequences x that simultaneously minimize 



&{x) for all£ = l,...,f. 

3. Given $ as the input, compute the average norm \x\, i.e., the average number of H residues in 
x, over all x € opt(<I>). 

4- Given $ and a target protein sequence x as the input, compute the average unweighted Hamming 
distance \x — x\ over all x G opt(<I>). 

5. Given <£, a target protein sequence x, and an integer d as the input, compute the number of 
protein sequences in opt( ( I ) ) at unweighted Hamming distance d from x. 

Proof: Note that each of the problems is in #P, because we can recognize an element of opt( < I ) ) 
in polynomial time using Lemma ||. So to prove #P-completeness we must only show that each 
problem is #P-hard. 

Statement |]. Reduce from the problem of counting the number of ideals in a dag, which is #P- 
hard p8|| . Given a dag G, apply Theorem ^ to get a function <3? for which Gf t is isomorphic to G. 
By Theorem |, counting opt( ( I ) ) is then equivalent to counting the number of ideals of Gf t = G. 

Statement % The problem in Statement [l] is a special case of the problem in this statement. 

Statement 3|. Using the same construction as in Statement |], we can reduce from the problem 
of computing the average cardinality of an ideal in G. To see that this latter problem is #P-hard, 
suppose that we can compute the average cardinalities of ideals in an n-node G and in an augmented 
graph G' obtained from G by adding a single new node s and edges from every v £ V(G) to s. Let c 
be the average for G and c' the average for G' . Let iV be the number of ideals in G. Then c = K/N 
for some K, while c' = (K + n + l)/(A r + 1), since the only new ideal in G' consists of s and all other 
nodes, and thus has size n + 1. Solving for N gives N = (n + 1 — c')/(c' — c), which can be computed 
from c, c', and n. 

Statement I. This problem has the problem in Statement |3| as a special case with all £i = 0. 

Statement 5. To reduce the problem of counting protein sequences in opt( < I ) ) to counting protein 
sequences at a given unweighted Hamming distance, take the dag Gf t given by Theorem ||, and add 
to each node Vi a node qi with edges (vi,qi) and (qi,Vi). Apply Theorem^ to this new graph G' to 
obtain a function <£' for which opt(<I> / ) is in one-to-one correspondence with the set of ideals of the 
strongly-connected component graph G' of G' . Each strongly connected component of G' consists 
of Vi and qi for some i, so & is isomorphic to Gf t . Now we choose x with x Vi = and x qi = 1. 
Then, the contribution to \x — x\ of each pair x Vi ,x qi is 1 regardless of their common value. Thus, the 
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number of ^-minimizing protein sequences equals the number of ^'-minimizing protein sequences at 
distance d from x, where d is the number of nodes in Gf t . I 

Theorem 18 (hardness of plotting the energy-distance landscape) Let <3? be as defined in 
Assumption \F1\ . 

1. Given <3? and two integers d\,d 2 as the input, it is NP-complete to determine whether there is 
an ^-minimizing x with d\ < \x\ < d 2 - 

2. Let x be a target protein sequence. For d € {0, . . . ,n}, let F(d) be the smallest <l>(x) over all 
protein sequences x at unweighted Hamming distance d from x. Given <I> and d as the input, it 
is NP-hard to compute F(d). 

Proof: Statement [2] follows from the fact that the problem in Statement Q can be reduced to 
the problem in this statement in polynomial time. Statement |l] is proved as follows. 

Since we can recognize ^-minimizing protein sequences using Lemma ||, the problem is clearly in 
NP. To show that it is NP-hard, we reduce from PARTIALLY ORDERED KNAPSACK, problem 
MP 12 from Garey and Johnson g|, pp. 247-248]. 

The input to PARTIALLY ORDERED KNAPSACK consists of a partially-ordered set U, each 
element u of which is assigned a size s(u) 6 Z + and a value v(u) £ Z + , together with a upper bound 
B\ on total size and a lower bound B 2 on total value. The problem is to determine whether there 
exists an ideal I in U such that J2 U £i s ( u ) — &i an d J2uei v ( u ) — B 2 - Garey and Johnson note 
that the problem, even with s(u) = v(u) for all u E U, is NP-complete in the strong sense (meaning 
that there is some polynomial bound on the size of all numbers in the input with which it remains 
NP-complete) . 

Given an instance of PARTIALLY ORDERED KNAPSACK with s(u) = v(u) for all u and all 
numbers bounded by some polynomial p, build a graph G where each u G U is represented by a 
clique C(u) of s(u) nodes, and there is an edge from C(u) to C(u') if and only if u ~< u' . Note 
that because s(u) < p, G has polynomial size. Apply Theorem || to generate a function $ (in 
polynomial time) such that Gf t is isomorphic to the component graph Go obtained by contracting 
all strongly connected components of G. Since the strongly connected components of G are precisely 
the cliques C(u), Gf t = Go is isomorphic to U, interpreted as a dag. In particular any ideal of Gf t 
corresponds to an ideal / of U. Let iV = J2ueu s ( u )- The norm of the corresponding vector \x\ is 
N - Eng/ \C(u)\ = N - Y,u&i s(u) = N - Eueiv(u). Set d x = N - B u d 2 = N - B 2 , and we have 
the problem stated in the theorem. I 



6 Applications to Empirical Protein 3D Structures 

To demonstrate our algorithms, we chose 34 proteins with known 3D structures from the Protein 
Data Bank (PDB) at http://www.rcsb.org/pdb. These 3D structures included 8 from Kleinberg's 
study [|l^] but excluded the protein fragments and multimeric proteins used in that study. The chosen 
3D structures were then represented by centroids for each side chain calculated from the coordinates 
of each atom in the side chain; in the case of 3D structures solved by NMR, hydrogen atoms were 
included into centroid calculations. For glycine, the centroid was taken to be the position of C a . For 
each side chain, the area of solvent accessible surface was computed via the Web interface of the ASC 



program with default parameters [1C]. In accordance to the GC model, each of the chosen native 
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Name 


Solvent /Length 


Length 


a/ (3 


% Similarity 


Description 


la7m 


51.23 


180 


-295.8 


74.44 


cytokine 


la8y 


81.37 


338 


-155.7 


73.37 


Ca binding protein 


lab3 


399.05 


88 


-326.7 


78.41 


ribosomal protein 


lab7 


451.48 


89 


-79.5 


80.90 


ribonuclease inhibitor 


lagi 


384.30 


125 


-93 


77.60 


endonuclease 


lair 


173.96 


352 


-0.3 


69.89 


pectate lyase 


lb71 


374.23 


191 


-23.7 


69.11 


electron transport 


lble 


498.45 


161 


-269.4 


72.67 


phosphotransferase 


lbpi 


1453.75 


58 


-31.5 


68.97 


proteinase inhibitor 


lbw3 


732.31 


125 


-33.9 


70.40 


lectin 


lclh 


607.48 


166 


-9.3 


69.28 


cyclophilin 


lehs 


2178.29 


48 


-32.1 


72.92 


enterotoxin 


levm 


392.97 


296 


-4.5 


73.99 


phospholipase 


lnar 


447.48 


289 


-6.3 


75.78 


plant seed protein 


lprn 


498.38 


289 


-2.7 


56.40 


porin 


lthv 


741.41 


207 


-10.5 


71.01 


sweet tasting protein 


lxnb 


871.63 


185 


-135.6 


65.95 


glycosidase 


2aak 


1130.39 


150 


-149.1 


78.67 


ubiquitin conjugation 


2bnh 


412.43 


456 


-253.8 


78.51 


ribonuclease inhibitor 


2cba 


773.98 


258 


-2.4 


73.64 


lyase 


2erl 


5064.83 


40 


-47.7 


80.00 


pheromone 


2stv 


1153.80 


184 


-3.6 


64.67 


viral coat protein 


6yas 


904.99 


256 


-12.6 


67.19 


lyase 


8cho 


2054.72 


125 


-423.6 


72.80 


isomerase 


9rat 


2126.95 


124 


-89.4 


75.00 


ribonuclease A 


laai 


51.89 


105 


-155.1 


70.48 (72) 


electron transport 


laba 


125.39 


87 


-245.4 


78.16 (70) 


electron transport 


lbba 


584.25 


36 


-69.6 


66.67 (58) 


pancreatic hormone 


lbrq 


272.54 


174 


-27.3 


72.99 (71) 


retinol transport 


lcis 


780.88 


66 


-454.8 


69.70 (64) 


lysozyme 


lhel 


539.14 


129 


-18.6 


76.74 (78) 


fatty acid bind protein 


lifb 


583.57 


131 


-40.5 


79.39 (70) 


Ca binding protein 


3cln 


835.80 


143 


-309.9 


72.03 (70) 


ribonuclease 


3rn3 


1018.66 


124 


-88.8 


72.58 (69) 


electron transport 



Table 1: This table gives some statistics of our computed fittest H/P protein sequences for the 34 
empirical protein 3D structures chosen from PDB. Column 1 contains the PDB name of the native 
protein sequence, where the proteins from Kleinberg [^] are in bold. Column 2 shows the length 
normalized solvent accessibility, i.e., YlieHlx) s %l n i of the computed fittest protein sequence. Column 
4 lists the protein sequence length n, i.e., the count of amino acids. Column 4 provides the computed 
optimal ratio of the a and (3 parameters (see text). Column 5 displays the percentage of similarity 
of the computed fittest protein sequence to the native protein sequence. Column 6 describes the 
function of the native protein. 
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Figure 1: This plot displays the relationship between the percent similarity of computed fittest protein 
sequences to native protein sequences versus the PFAM family size of native protein sequences. 
Globular native proteins are shown as open circles while nonglobular ones are shown as crosses. 



protein sequences was converted into a binary H/P sequence following Sun et. al. 33], where A, C, 
F, I, L, M, V, W, Y are H, and the other amino acids are P. 

We used Equation || in the GC model to calculate fitness values of protein sequences to determine 
minimal energy values and consequently to compute the "fittest" protein sequences. These fitness 
values consist of two terms in Equation [l]. The first term accounts for the idea that hydrophobic 
residues tend to cluster together due to stacking forces from the solvent. The second term accounts 
for the idea that hydrophobic residues tend to avoid solvent accessible surfaces of the molecule. 
The arbitrary parameters a and (3 represent scaling factors for the relative importance of these two 
tendencies. We expected the appropriate ratio of these two values to depend on the type of a protein 
(globular, nonglobular, and membrane) and the length of the protein. Therefore, we optimized the 
scale of the two parameters to find a ratio that maximizes the similarity of a fittest protein sequence 



to the native protein sequence, similarly in spirit to Kleinberg's scaling algorithm [18|. 

Results of this optimization are shown in Table [I]. As anticipated, our algorithms computed 
fittest protein sequences that are closer to native protein sequences than found by Kleinberg [18|, 
whose results are shown in parenthesis in Table [l]. (Note that protein laaj is an exception to this 
improvement on proximity — perhaps due to the fact that the input data are not exactly the same.) 
However, proximity to native protein sequences is not a good proxy for biological relevance of the 
algorithms. Determination of protein 3D structure involves an energy landscape given by a statistical 
thermodynamic energy function E(X,T>) where X is the set of amino acid sequences and T> is the set 
of possible folded 3D structures. On the one hand, for a fixed protein sequence x, the 3D structure D 
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is determined by a temperature-dependent folding process that minimizes E{x,D) over T>. Ab initio 
solutions to this problem for protein sequences of practical length currently do not exist. On the other 
hand, for a fixed 3D structure D, no known thermodynamic reasons connect the folding process with 
a protein sequence x that minimizes E(x, D) over X. However, we can invoke evolutionary processes 
as possibly selecting for those protein sequences that produce the most stable 3D structures, i.e., 
those with the lowest E(x,D), at a given temperature. In other words, given a suite of protein 
sequences that fold into a particular 3D structure to perform a biological function, there might be 
selection for the most thermodynamically stable protein sequences. 

From the view point of evolutionary selection, the difference between computed fittest protein 
sequences and native protein sequences may be attributed to three factors. First, the GC toy ther- 
modynamic model is inappropriate. Second, the biological function of a protein actually requires 
structural lability. Last, a native protein is part of a diverse family and other members of the family 
lie closer to the computed fittest protein sequence. This last factor can be augmented by the argument 
that if a computed fittest protein sequence exists in nature but is very different from the native protein 
sequence, it is likely that many other protein sequences (thus a diverse family of protein sequences) 
exist in nature and fold into the same or similar 3D structures. All of these factors are likely to play 
in the data shown in Table |l[ However, we conjectured a significant relationship between a computed 
fittest protein sequence's similarity to a native protein sequence and the diversity of the native protein 
in nature. Such a relationship would be highly intriguing biologically. We examined this conjecture 
by assessing the diversity of native proteins using the database PFAM at http://pfam.wustl.edu, 
which is a database of protein families determined through Hidden Markov Models 0j . The database 
contained information on the putative family size of 25 of our 34 chosen native proteins. Figure [l] 
shows the plot of the percent similarity of computed fittest protein sequences to native protein se- 
quences versus the PFAM family size of native proteins. In the figure, globular proteins are shown 
as circles and nonglobular ones as crosses. There is a negative linear trend as suggested by our con- 
jecture. Linear regression is nearly significant at 0.05 level with p = 0.088. The figure shows three 
outliers, 2bnh, 2stv, and 8cho. Of these, 2bnh is an exceedingly strange 3D structure with a protein 
sequence of alpha helixes forming a horseshoe shaped sheet, resulting in a 3D structure that is very 
deviant from globular proteins which are the genesis of the original thermodynamic model. Leaving 
out this outlier results in a significant linear regression with p = 0.015. 

There is still considerable uncertainty about the appropriateness of the GC toy model. The 
average percentage of the hydrophobic residues is 42% in the native protein sequences compared to 
35% in the computed fittest protein sequences. More importantly, the standard deviation of the 
percentage of hydrophobic residues is 0.054 in the native protein sequences compared to 0.143 for 
the computed fittest protein sequences. Thus, the percentage of hydrophobic residues is relatively 
constant in the native protein sequences, reflecting perhaps a functional need or unknown structural 
factors. In contrast, the percentage of hydrophobic residues in a computed fittest protein sequence 
tends to vary depending on the 3D structure. This suggests that it might be important to introduce 
a hydrophobic residue percentage constraint into optimization algorithms in the future as suggested 
in the sliding algorithm of Kleinberg [18]. Nevertheless, our preliminary results show that even such 
a simplified toy model might be useful for exploratory investigations of protein evolution especially 
when coupled to computationally efficient algorithms to allow systematic investigation of the roughly 
13,000 empirical protein 3D structures. We are currently planning a large-scale analysis of further 
empirical protein 3D structures; the results will be reported in a subsequent paper. 
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